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The computer-aided diagnosis system plays an important role in the 
classification of diseases and genes such as psychological or other diseases. 
Bipolar disorder (BD) is a commond psychological disease nowdys. Genes 
that describe this type of disease may include irrelative values to bipolar 
disorder disease. These values may adversely impact the classification 
performance. Logistic regression (LR) and recently sparse logistic regression 
(SLR) were used as a common technique to solve such binary classification 
problems. Gene selection has been applied to be a successful technique to 
get better classification output by excluding the irrelative values of genes. In 
this work we go further in improving the classification accuracy by restoring 
to incorporating the weight of these genes utilizing integrating the 
standardization of T-test with the sparse logistic regression, aiming to 
accomplish high classification accuracy. A bipolar dataset of gene 
expressions measured for 22283 genes using Affymetrix technology was 
used. Two performance indicators; classification accuracy, and geometric- 
mean of specificity and sensitivity are considered in evaluating the proposed 
method. Experimental results show an improvement over the two competitor 
methods; SLR-smoothly clipped absolute deviation (SCAD) and SLR-lasso 
in three indicators: classification accuracy, geo-means, and area under the 
curve. Therefore, our technique is beneficial to predict and classify BD 
psychopaths. 
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1. INTRODUCTION 


Bipolar disorder (BD) is a debilitating mental disease from which approximately 4 to 7% of the US 
population suffered [1] and affects up to 3% of the worldwide population [2]. BD is characterized by frequent 
mood alteration between two different situations (mania and depression). Mostly, inheritance plays a major 
role in BD affection. The environmental factors have their contribution to pathogenesis [1]. BD has a nuclear 
connection to other illnesses such as schizophrenia or hypomania. There is a difficulty in differentiating BD 
from other psychosis, especially unipolar disorder depression (UDD), because of the similar syndromes that 
appear on other psychological patients [2]. With early diagnosis and a good treatment plan, BD patients can 
be successfully managed. Over the last years, great efforts have been done to identify disease-related genes or 
biomarkers which lead to early detection and treatment [1]. Correct early diagnosis is very important, that is 
BD patients who are misdiagnosed within the first years of treatment reach 80%. This lateness has the 
consequence that some of the patients go to suicide. In addition, incorrect therapy may cost a large amount of 
money. So, preventing suicide behavior is the great gain of early detection and treatment of mood disorder [2]. 
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DNA microarray technology made a great revolution in the field of biology and genetic research. 
This technology deals with the expression value of a huge number of genes concurrently. The gene 
expression data that represent the genes that were elicited from a specific tissue provides useful information 
for understanding the biological effect and value of that tissue [3], [4]. These gene expression datasets are 
utilized in various fields of application, such as breast, lung, and prostate cancer classification and early 
detection of tumors [5], [6]. 


2. METHOD 
2.1. Gene expression 

No gene set has been known as the effective gene on BD. In the last years, gene expression using 
the microarray technique was commonly used for the inspection of complex disorders [2]. Gene expression 
data is always represented by a matrix; the samples are the matrix rows while the genes are the matrix 
columns. In these matrices, the number of columns (genes) is always more than the number of rows 
(samples) [7]—[10]. 

As the classification performance is highly affected by the many irrelevant and redundant genes 
contained in these datasets, the reduction of the dimensionality of these datasets emerges as a necessity and it 
has received increasing attention over the last thirty years [11], [12]. Gene selection methods were applied to 
reduce the dimensionality of the gene expression datasets [7], [13]. In general, these methods can be divided 
into filter, wrapper, and embedded methods [14]-[16]. The first is the most common of the three. These 
methods depend on a special tactic to get the gene information for every gene alone. This tactic is used 
separately before the classification process and is not subject to the classification method. As for the second 
group, the optimization of the classification performance is obtained by a process based on the interpretation 
of the classification algorithm. In the last group, gene selection and data classification merge in one technique 
concurrently [17]—[20]. Compared to these methods, wrapper methods are considered as having a lower 
computational efficiency [13], [21]-[23]. Improving classification performance is considered an important 
goal of gene selection. Gene selection downscales the high dimension microarray by eliminating the 
unrelated genes which lead to speeding up the classifying process and decreasing the risk of overfitting and 
increasing computational time. 


2.2. Sparse logistic regression 

Logistic regression (LR) is one of the most common techniques applied in binary classification 
problems. The regression function has a nonlinear structure of the variables or genes. The logistic regression 
response value is either one or zero, always one is used for the positive and zero for the negative cases or 
healthy [19], [24]. LR is considered unfeasible in gene expression data classification as the gene expression 
matrix is singular [25]. 

Recently, much attention is given to sparse logistic regression (SLR) as a classification method 
where it merges classical LR with a penalty to classify data with the fewest number of features or genes [14], 
[24], [25]. Different penalties could be imposed to produce different SLR models. The most popular and 
widely panelized method is the least absolute shrinkage and selection operator, lasso (L1-Norm) [25]. 
Despite its popularity, SLR-lasso has its drawbacks, first, SLR-lasso selects only one gene and ignores the 
rest from a group of genes even if there are high correlations among them, second, the same magnitude of 
shrinkage to each gene coefficient is used, which leads to inconsistently gene selection. 

Let n represents the independent observations {y;, t;;i = 1,2,...,n} where y; € {1,0} are response 


values (class labels), and t; = (feats). is a vector of predictor genes & p represents the number of 
genes. 
Consequently, the LR formula is explained as: 


Pro b(yi = 1:t;) = W(t), (1) 
Pro b(yi = 0:t;) =1- W(t), (2) 


This likelihood can be simplified as: 


pi exp(a+tT B) 
vad = (1+exp(a+tT B)y (3) 
where a + t?B =a + tabı + tizBot....ttipP and In | t) | = a + eT B. 
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The function of log-likelihood concerning response variables y; can be clarified as: 


P(B) = LLibi(a + t? p) — In{1 + expla + t? BH), (4) 


The SLR-lasso imposes a nonnegative penalty term to (4) to enable controlling the number of the 
genes in high dimension and is defined as: 


r (B) = Lhilyi(@ + 7B) — In{1 + expla +t P) -4X8 l] (5) 
BWSLR=argMax{Y is Lyi(a + t B)] — In{1 + exp(a + t? P)} — A(BY}, (6) 


As shown in (5) [5] if we supposed the genes are standardized, then the values of the parameters a&f are got 
by maximizing the SLR-lasso as shown in (6). This equation A(f ) represents the penalty term that improves 
the estimation. This term controls the degree of shrinkage. The 2 = 0 value leads to the maximum probability 
method solution while a large value A increases the impact of the penalty term on the coefficient evaluations 
[19], [21], [26]. 


2.3. Weighted sparse logistic regression 

The information obtained from the measurements of genes has a great role in improving gene 
selection. This paper suggests a weighting scheme that integrates the standardization of T-test (S-Ttest) into 
WSLR to get better performance of the identification of the related genes. The measurements related to genes 
that have incorporeal correlation are based on finding the weights so that they can make a better 
representation for the differential expressions among genes in the data. The main purpose of combining 
weights is to underline the differential expression of genes and it permits to make good detecting of these 
genes via the process of gene selection utilizing sparse logistic regression. The S-Ttest weights are calculated 
as in (7): 


W; = 


(7) 


tea 


where |T;| represents the jth gene value provided by the two-sample T-test. In this paper, the weight indicates 
the expression difference of genes in samples. Therefore, the most differentially expressed genes own the 
largest weight and will achieve the highest degree of accuracy in the classification. The classification of these 
weights is as shown in Figure 1. 


1. Absolute two-sample 


Samples t-test for each gene 


pa—an — =a | 
Gene 1 | Expression values | ===> [Toene1| === Weene1 (|7;| — u) 
Gene 2 | Expression values | => |[Toene2]| ===> Weene 2 W; = ee V 
` : : f : : 3. S-Ttest weight for 
each gene 


Gene d | Expression values me |a m Woened 


2. Mean value of 54| 
the absolute T-test K= 
of all genes 


Figure 1. Example of S-Ttest weight calculations of genes 


As illustrated in the figure, the weight for each gene is computed by subtracting the mean value from 
the T-test value for the gene, then dividing the result by the standard deviation. Then a filtration is done to 
exclude genes that do not have a differential expression between the two groups according to the p-values 
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greater than (0.05) generated during the two-sample test. Then each of the remaining genes is multiplied by 
the calculated weights. The result is the weighted gene expression for the microarray as shown in Figure 2. 


Genes Same weight for a gene 
Samples 4 throughout the samples 
—==== ty sss tia Wyti1 AaS Watia 
ew 
thi vt tha With Ei Watna 
Default Weighted 
expression data expression data 
[Wy eea n Wal 
Weights for d genes 
[W -t = Wit] 


Figure 2. Processing the gene expression for bipolar disorder data with S-Ttest weight 


2.4. Experimental setting 

The bipolar dataset [27] consists of 30 bipolar disorder and 31 controls. Gene expressions are 
measured for 22283 genes using Affymetrix technology. To show the improvement of the proposed method 
over other methods, experiments were done and compared with SLR-lasso and the SLR-SCAD. Initially, the 
dataset was arbitrarily split to train and test data, where 70% of the samples were chosen for the training 
group and 30% for the testing group. 

For alleviating the effect of the partitioning and for a fair comparison, the classification accuracy 
was used as an evaluation indicator for all the competitor methods by finding the average up to 20 partitioned 
times using 10 folds cross-validation for the training group in each time. The parameter value A was adjusted 
to basing on the training group while, as for the SCAD penalty the constant value was adjusted to 3.7 as Fan 
and Li (2001) proposed [22]. 


2.5. Performance evaluation criteria 

Two performance indicators; classification accuracy (CA), and geometric-mean of specificity and 
sensitivity (geo-mean) are considered in evaluating the proposed method. The CA represents the exact 
percentage of the classified psychopath and healthy persons; this measure is used to evaluate the classifier 
power. This indicator can be calculated as: 


cCA = UNE x100% (8) 
(TP+TN+FP+FN) 

where: TP is no. of true-positive, TN is no. of true-negative, FP is no. of false-positive, FN is no. of false 

negative. Maximizing the accuracy in both categories of humans is considered as a goal of the common 

classification methods. The second metric Geo-means has been proposed to show the united performance of 

specificity and sensitivity, and can be defined as: 


Geo — mean = Sqrt(spesificity X sensitivity), (9) 


where specificity (TN rate) is the proportion of properly classified healthy humans, and sensitivity (TP rate) 
is the proportion of successfully classified psychopath humans [25]. 


2.6. Bipolar disorder gene expression data classification 

Gene expression data preprocessed (weighted gene matrix) will be used with the SLR to get more 
accuracy in Classification than traditional methods. The algorithm of the weighted SLR (WSLR) calculation 
is as: 


Algorithm 1: 

Applying a two-sample T-test to extract significant genes. 
Find w; (Equation (7)), (Figurel.). 

Define t;=W,.t; (Figure 2.) 

Applying BWSLR (Equation (6)) 
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3. RESULTS AND DISCUSSION 
3.1. Classification indicator 

The improvements in the classification accuracy of our proposed technique and the results of SLR- 
SCAD, and the SLR-lasso are summarized in Table 1. The table also shows the number of selected genes and 
the Geo-mean for the training data set. Each value in the table is attached to the corresponding standard 
deviation. As for classification accuracy and depending on the training group, the WSLR yields 91.936 
defeating SLR-SCAD by 6.08% and SLR-lasso by 12.935%. 


Table 1. Classification indicators of the proposed technique and its rivals 


Techniques Train Test 
#Genes CA Geo-mean CA 
WSLR 22 91.936 0.895 86.817 
(1.878) (0.034) (3.416) 
SLR-SCAD 29 86.345 0.846 80.115 
(2.756) (0.009) (1.459) 
SLR-lasso 35 80.044 0.797 76.887 


(4.818) (0.003) (3.309) 


The geo-mean of the WSLR achieved 0.895, which shows that WSLR has a clear distinction 
between healthy individuals and individuals with BP. In addition, it is shown that SLR-SCAD yields a value 
of 86.345, which is better than that for SLR-lasso. This result is expectant because the SLR-SCAD has high 
consistency selector efficiency. Moreover, WSLR yields better results than the other techniques regarding 
classification accuracy as it yields 86.817 which is 7.7197% and 11.438% better than SLR-SCAD and SLR- 
lasso respectively. The superiority above is related to the number of genes. Our approach has made progress 
in gene selection, where it selects 22 genes versus 29, and 35 for SLR-SCAD and SLR-lasso respectively. 
Overall, the classification indicator of the proposed technique has obtained the best classification 
performance compared to SLR-SCAD and SLR-lasso. That indicates that the proposed technique is useful to 
give us more information about genes’ influence on bipolar disorder. 


3.2. Classification indicator 

To emphasize how classification performance is affected by WSLR in selecting the most important 
relevant genes, a pairwise comparison was done between WSLR and other competitor techniques using T- 
test results considering the area under the curve (AUC) of the train data. Figure 3 presents the AUC boxplot 
of our proposed technique and its rivals. It was clear that the AUC of our proposed technique is better than 
the results of the competing techniques. Table 2 reports the test result at significance level a=0.05. It's clear 
our WSLR technique statistically records better significance than those of SLR-SCAD and SLR-lasso 


0.88 F 


AUC 


WPLR SLR-SCAD SLR-4asso 


Bipolar dataset 


Figure 3. AUC boxplot for the bipolar-disorder dataset achieved using (weighted/non-weighted) sparse 
logistic regression 


Bulletin of Electr Eng & Inf, Vol. 11, No. 2, April 2022: 1062-1068 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 O 1067 


Table 2. P-values for the t-test of the proposed technique and its rivals 
Dataset WSLR vs SLR-SCAD _WSLR vs SLR-lasso 
Bipolar-disorder 0.0029 (*) 0.0009 (*) 
(*) significant differences 


4. CONCLUSION 

In this work, sparse logistic regression has been used after data weight. The main objective of data 
weight is to identify the relevant genes in bipolar disorder data and detect genes’ influence on bipolar 
disorder disease. The proposed technique is WSLR, it has beaten clearly in the results of classification and 
genes selection to boost the used technique. Three comparative aspects: high classification accuracy, Geo- 
mean, and AUC were considered to show the classification performance efficiency of WSLR. Taking on 
these three aspects concurrently puts on WSLR as a favorable gene selection method. Overall, WSLR 
presents its utility and applicability in other fields of huge data classification related to the psychological 
diseases domain and other fields such as features selection in pattern recognition or variable selection in the 
statistical domain. 
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